home *** CD-ROM | disk | FTP | other *** search
Wrap
Text File | 2001-04-17 | 49.9 KB | 1,127 lines
mmmmmmmmcccciiii((((5555)))) mmmmmmmmcccciiii((((5555)))) NNNNAAAAMMMMEEEE mmci - Memory Management Control Interface DDDDEEEESSSSCCCCRRRRIIIIPPPPTTTTIIIIOOOONNNN This document describes the concepts and interfaces provided by IRIX for fine tuning memory management policies for user applications. PPPPoooolllliiiiccccyyyy MMMMoooodddduuuulllleeeessss The ability of applications to control memory management becomes an essential feature in multiprocessors with a CCNUMA memory system architecture. For most applications, the operating system is capable of producing reasonable levels of locality via dynamic page migration and replication; however, in order to maximize performance, some applications may need fine tuned memory management policies. We provide a Memory Management Control Interface based on the specification of policies for different kinds of operations executed by the Virtual Memory Management System. Users are allowed to select a policy from a set of available policies for each one of these VM operations. Any portion of a virtual address space, down to the level of a page, may be connected to a specific policy via a Policy Module. A policy module or PM contains the policy methods used to handle each of the operations shown in the table below. MEMORY OPERATION POLICY DESCRIPTION _______________________________________________________________________ Initial Allocation Placement Policy Determines what physical memory node to use when memory is allocated Page Size Policy Determines what virtual page size to use to map physical memory Fallback Policy Determines the relative importance between placement and page size _______________________________________________________________________ Dynamic Relocation Migration Policy Determines the aggressiveness of memory migration Replication Policy Determines the aggressiveness of replication _______________________________________________________________________ Paging Paging Policy Determines the aggressiveness and domain of memory paging When the operating system needs to execute an operation to manage a section of a process' address space, it uses the methods specified by the Policy Module connected (attached) to that section. To allocate a physical page, the VM system physical memory allocator first calls the method provided by the Placement Policy that determines where the page should be allocate from. Internally, this method returns a handle identifying the node memory should be allocated from. The Placement Policy is described in detail later in this document. PPPPaaaaggggeeee 1111 mmmmmmmmcccciiii((((5555)))) mmmmmmmmcccciiii((((5555)))) Second, the physical memory allocator determines the page size to be used for the current allocation. This page size is acquired using a method provided by the Page Size Policy. Now, knowing both the source node and the page size, the physical memory allocator calls a per-node memory allocator specifying both parameters. If the system finds memory on this node that meets the page size requirement, the allocation operation finishes successfully; if not, the operation fails, and a fallback method specified by the Fallback Policy is called. The fallback method provided by this policy decides whether to try the same page size on a different node, a smaller page size on the same source node, sleep, or just fail. The Fallback Policy to choose depends on the kind of memory access patterns an application exhibits. If the application tends to generate lots of cache misses, giving locality precedence over the page size may make sense; otherwise, especially if the application's working set is large, but has reasonable cache behavior, giving the page size higher precedence may make sense. Once a page has been placed, it stays on its source node until it is either migrated to a different node, or paged out and faulted back in. Migratability of a page is determined by the migration policy. For some applications, those that present a very uniform memory access pattern from beginning to end, initial placement may be sufficient and migration can be turned off; on the other hand, applications with phase changes can really benefit from some level of dynamic migration, which has the effect of attracting memory to the nodes where it's being used. Read-only text can also be replicated. The degree of replication of text is determined by the Replication policy. Text shared by lots of processes running on different nodes may benefit substantially from several replicas which both provide high locality and minimize interconnect contention. For example /bin/sh may be a good candidate to replicate on several nodes, whereas programs such as /bin/bc really don't need much replication at all. Finally, all paging activity is controlled by the Paging Policy. When a page is about to be evicted, the pager uses the Paging Policy Methods in the corresponding PM to determine whether the page can really be stolen or not. Further, this policy also controls page replacement. The current version of IRIX provides the policies shown in the table below. PPPPaaaaggggeeee 2222 mmmmmmmmcccciiii((((5555)))) mmmmmmmmcccciiii((((5555)))) POLICY TYPE POLICY NAME ARGUMENTS __________________________________________________________________ Placement Policy PlacementDefault Number Of Threads PlacementFixed Memory Locality Domain PlacementFirstTouch No Arguments PlacementRoundRobin Roundrobin Mldset PlacementThreadLocal Application Mldset PlacementCacheColor Memory Locality Domain __________________________________________________________________ Fallback Policy FallbackDefault No Arguments FallbackLargepage No Arguments __________________________________________________________________ Replication Policy ReplicationDefault No Arguments ReplicationOne No Arguments __________________________________________________________________ Migration Policy MigrationDefault No Arguments MigrationControl migr_policy_uparms_t MigrationRefcnt No Arguments __________________________________________________________________ Paging Policy PagingDefault No Arguments __________________________________________________________________ Page Size Policy - Page size __________________________________________________________________ The following list briefly describes each policy. PlacementDefault This policy automatically creates and places an MLD for every two processes in a process group on an Origin 2000. For the Origin 3000 this policy creates and places an MLD for every four processes in a process group. The number of processes is provided as a passed in argument. Each process's memory affinity link (memory affinity hint used by the process scheduler) is automatically set to the MLD created on behalf of the process. The MLD(s) estimate a memory affinity hint based on the size of the currently running process address space. Memory is allocated by referencing the MLD being used as the memory affinity link for the currently running process. By using this policy the application does not need to create and place MLD(s) or an MLDSET. PlacementFixed This policy requires a placed MLD to be passed as an argument. All memory allocation is done using the node where the MLD has been placed. PlacementFirstTouch This policy starts with the creation of one MLD, placing it on the node where creation happened. All memory allocation is done using the node where the MLD has been placed. PlacementRoundRobin This policy requires a placed MLDSET to be passed as an argument. Memory allocation happens in a round robin fashion over each one of the MLDs in the MLDSET. The policy maintains a round robin PPPPaaaaggggeeee 3333 mmmmmmmmcccciiii((((5555)))) mmmmmmmmcccciiii((((5555)))) pointer that points to the next MLD to be used for memory allocation, which is moved to point to the next MLD in the MLDSET after every successful memory allocation. Note that the round robin operation is done in the time axis, not the space axis. PlacementThreadLocal This policy requires a placed MLDSET to be passed as an argument. The application has to set the affinity links for all processes in the process group. Memory is allocated using the MLD being used as the memory affinity link for the currently running process. PlacementCacheColor This policy requires a placed MLD to be passed as an argument. The application is responsible for setting the memory affinity links. Memory is allocated using the specified MLD, with careful attention to cache coloring relative to the Policy Module instead of the global virtual address space. FallbackDefault The default fallback policy gives priority to locality. We first try to allocate a base page (16KB in Origin systems) on the requested node. If no memory is available on that node, we borrow from some close neighbor, following a spiral search path. FallbackLargepage When this fallback policy is selected, we give priority to the page size. We first try to allocate a page of the requested size on a nearby node, and fallback to a base page only if a page of this size is not available on any node in the system. ReplicationDefault When this policy is selected, read-only pages are replicated following the _C_o_v_e_r_a_g_e _R_a_d_i_u_s algorithm described in replication(5). ReplicationOne Force the system to use only one replica. MigrationDefault When this default migration policy is selected, migration behaves as explained in migration(5) according to the tunable parameters also described in migration(5). MigrationControl Users can select different migration parameters when using this policy. It takes an argument of type migr_policy_uparms_t shown below. PPPPaaaaggggeeee 4444 mmmmmmmmcccciiii((((5555)))) mmmmmmmmcccciiii((((5555)))) typedef struct migr_policy_uparms { __uint64_t migr_base_enabled :1, migr_base_threshold :8, migr_freeze_enabled :1, migr_freeze_threshold :8, migr_melt_enabled :1, migr_melt_threshold :8, migr_enqonfail_enabled :1, migr_dampening_enabled :1, migr_dampening_factor :8, migr_refcnt_enabled :1; } migr_policy_uparms_t; This structure allows users to override the default migration parameters defined in /var/sysgen/mtune/numa and described in migration(5). - mmmmiiiiggggrrrr____bbbbaaaasssseeee____eeeennnnaaaabbbblllleeeedddd enables (1) or disables (0) migration. - mmmmiiiiggggrrrr____bbbbaaaasssseeee____tttthhhhrrrreeeesssshhhhoooolllldddd defines the migration threshold. - mmmmiiiiggggrrrr____ffffrrrreeeeeeeezzzzeeee____eeeennnnaaaabbbblllleeeedddd enables (1) or disables (0) freezing. - mmmmiiiiggggrrrr____ffffrrrreeeeeeeezzzzeeee____tttthhhhrrrreeeesssshhhhoooolllldddd defines the freezing threshold. - mmmmiiiiggggrrrr____mmmmeeeelllltttt____eeeennnnaaaabbbblllleeeedddd enables (1) or disables (0) melting. - mmmmiiiiggggrrrr____mmmmeeeelllltttt____tttthhhhrrrreeeesssshhhhoooolllldddd defines the melting threshold. - mmmmiiiiggggrrrr____eeeennnnqqqqoooonnnnffffaaaaiiiillll____eeeennnnaaaabbbblllleeeedddd is a no-op for IRIX 6.5 and earlier. - mmmmiiiiggggrrrr____ddddaaaammmmppppeeeennnniiiinnnngggg____eeeennnnaaaabbbblllleeeedddd enables (1) or disables (0) dampening. - mmmmiiiiggggrrrr____ddddaaaammmmppppeeeennnniiiinnnngggg____ffffaaaaccccttttoooorrrr defines the dampening threshold. - mmmmiiiiggggrrrr____rrrreeeeffffccccnnnntttt____eeeennnnaaaabbbblllleeeedddd enables (1) or disables (0) extended reference counters. MigrationRefcnt This policy turns migration completely off (for the associated section of virtual address space) and enables the extended reference counters. No arguments are needed. PPPPaaaaggggeeee 5555 mmmmmmmmcccciiii((((5555)))) mmmmmmmmcccciiii((((5555)))) PagingDefault This is currently the only available paging policy. It's the usual IRIX paging policy. Page Size Users can select any of the page sizes supported by the processor being used. For Origin 2000 systems the allowed sizes are: 16KB, 64KB, 256KB, 1024KB (1MB), 4096KB (4MB), and 16384KB (16MB). CCCCrrrreeeeaaaattttiiiioooonnnn ooooffff PPPPoooolllliiiiccccyyyy MMMMoooodddduuuulllleeeessss A policy module can be created using the following Memory Management Control Interface call: typedef struct policy_set { char* placement_policy_name; void* placement_policy_args; char* fallback_policy_name; void* fallback_policy_args; char* replication_policy_name; void* replication_policy_args; char* migration_policy_name; void* migration_policy_args; char* paging_policy_name; void* paging_policy_args; size_t page_size; short page_wait_timeout; short policy_flags; } policy_set_t; pmo_handle_t pm_create(policy_set_t* policy_set); The policy_set_t structure contains all the data required to create a Policy Module. For each selectable policy listed above, this structure contains a field to specify the name of the selected policy and the list of possible arguments that the selected policy may require. The page size policy is the exception, for which the specification of the wanted page size suffices. Pages of larger sizes reduce TLBMISS overhead and can improve the performance of applications with large working sets. Like other system resources large pages are not guaranteed to be available in the system when the application makes the request. The application has two choices. It can either wait for a specified timeout or use a page of lower page size. The page_wait_timeout specifies the number of seconds a process can wait for a page of the requested size to be available. If the timeout value is zero or if the page of the requested size is not available even after waiting for the specified timeout the system uses a page of a lower page size. The policy_flags field allows users to specify special behaviors that apply to all the policies that define a Policy Module. The only special behavior currently implemented forces the memory allocator to prioritize cache coloring over locality, and it can be selected using the flag PPPPOOOOLLLLIIIICCCCYYYY____CCCCAAAACCCCHHHHEEEE____CCCCOOOOLLLLOOOORRRR____FFFFIIIIRRRRSSSSTTTT. For example: PPPPaaaaggggeeee 6666 mmmmmmmmcccciiii((((5555)))) mmmmmmmmcccciiii((((5555)))) policy_set.placement_policy_name = "PlacementFixed"; policy_set.placement_policy_args = (void *)mld_handle; policy_set.fallback_policy_name = "FallbackDefault"; policy_set.fallback_policy_args = NULL; policy_set.replication_policy_name = "ReplicationDefault"; policy_set.replication_policy_args = NULL; policy_set.migration_policy_name = "MigrationDefault"; policy_set.migration_policy_args = NULL; policy_set.paging_policy_name = "PagingDefault"; policy_set.paging_policy_args = NULL; policy_set.page_size = PM_PAGESZ_DEFAULT; policy_set.page_wait_timeout = 0; policy_set.policy_flags = POLICY_CACHE_COLOR_FIRST; This example is filling up the policy_set_t structure to create a PM with a placement policy called "PlacementFixed" which takes a Memory Locality Domain (MLD) as an argument. All other policies are set to be the default policies, including the page size. We also ask for cache coloring to be given precedence over locality. Since filling up this structure with mostly default values is a common operation, we provide a special call to pre-fill this structure with default values: void pm_filldefault(policy_set_t* policy_set); The pm_create call returns a handle to the Policy Module just created, or a negative long integer in case of error, in which case errno is set to the corresponding error code. The handle returned by pm_create is of type pmo_handle_t. The acronym PMO stands for Policy Management Object. This type is common for all handles returned by all the Memory Management Control Interface calls. These handles are used to identify the different memory control objects created for an address space, much in the same way as file descriptors are used to identify open files or devices. Every address space contains one independent PMO table. A new table is created only when a process execs. A simpler way to create a Policy Module is to used the restricted Policy Module creation call: pmo_handle_t pm_create_simple(char* plac_name, void* plac_args, char* repl_name, void* repl_args, size_t page_size); This call allows for the specification of only the Placement Policy, the Replication Policy and the Page Size. Defaults are automatically chosen for the Fallback Policy, the Migration Policy, and the Paging Policy. PPPPaaaaggggeeee 7777 mmmmmmmmcccciiii((((5555)))) mmmmmmmmcccciiii((((5555)))) AAAAssssssssoooocccciiiiaaaattttiiiioooonnnn ooooffff VVVViiiirrrrttttuuuuaaaallll AAAAddddddddrrrreeeessssssss SSSSppppaaaacccceeee SSSSeeeeccccttttiiiioooonnnnssss The Memory Management Control Interface allows users to select different policies for different sections of a virtual address space, down to the granularity of a page. To associate a virtual address space section with a set of policies, users need to first create a Policy Module with the wanted policies, as described in the previous section, and then use the following MMCI call: int pm_attach(pmo_handle_t pm_handle, void* base_addr, size_t length); The ppppmmmm____hhhhaaaannnnddddlllleeee identifies the Policy Module the user has previously created, bbbbaaaasssseeee____aaaaddddddddrrrr is the base virtual address of the virtual address space section the user wants to associate to the set of policies, and lllleeeennnnggggtttthhhh is the length of the section. All physical memory allocated on behalf of a virtual address space section with a newly attached policy module follows the policies specified by this policy module. Physical memory that has already been allocated is not affected until the page is either migrated or swapped out to disk and then brought back into memory. Only existing address space mappings are affected by this call. For example, if a file is memory-mapped to a virtual address space section for which a policy module was previously associated via ppppmmmm____aaaattttttttaaaacccchhhh, the default policies will be applied rather than those specified by the ppppmmmm____aaaattttttttaaaacccchhhh call. DDDDeeeeffffaaaauuuulllltttt PPPPoooolllliiiiccccyyyy MMMMoooodddduuuulllleeee A new Default Policy Module is created and inserted in the PMO Name Space every time a process execs. This Default PM is used to define memory management policies for all freshly created memory regions. This Default PM can be later overridden by users via the pm_attach MMCI call. This Default Policy Module is created with the policies listed below: * PlacementDefault * FallbackDefault * ReplicationDefault * MigrationDefault * PagingDefault * Page size: 16KB * Flags: 0 PPPPaaaaggggeeee 8888 mmmmmmmmcccciiii((((5555)))) mmmmmmmmcccciiii((((5555)))) The Default Policy Module is used in the following situations: - At exec time, when we create the basic memory regions for the stack, text, and heap. - At fork time, when we create all the private memory regions. - At sproc time, when we create all the private memory regions (at least the stack when the complete address space is shared). - When mmapping a file or a device. - When growing the stack and we find that the stack's region has been removed by the user via unmap, or the user has done a setcontext, moving the stack to a new location. - When sbreaking and we find the user has removed the associated region using munmap, or the region was not growable, anonymous or copy-on- write. - When a process attaches a portion of the address space of a "monitored" process via procfs, and a new region needs to be created. - When a user attaches a SYSV shared memory region. The Default Policy Module is also stored in the per-process group PMO Name space, and therefore follows the same inheritance rules as all Policy Modules: it is inherited at fork or sproc time, and a new one is created at exec time. Users can select a new default policy module for the stack, text, and heap: pmo_handle_t pm_setdefault(pmo_handle_t pm_handle, mem_type_t mem_type); The argument pm_handle is the handle returned by pm_create. The argument mem_type is used to identify the memory section the user wants to change the default policy module for, and it can take any of the following 3 values: +o MEM_STACK +o MEM_TEXT +o MEM_DATA Users can also obtain a handle to the default PM using the following call: PPPPaaaaggggeeee 9999 mmmmmmmmcccciiii((((5555)))) mmmmmmmmcccciiii((((5555)))) pmo_handle_t pm_getdefault(mem_type_t mem_type); This call returns a PMO handle referring to the calling process's address space default PM for the specified memory type. The handle is greater or equal to zero when the call succeeds, and it's less than zero when the call fails, and errno is set to the appropriate error code. DDDDeeeessssttttrrrruuuuccccttttiiiioooonnnn ooooffff aaaa PPPPoooolllliiiiccccyyyy MMMMoooodddduuuulllleeee Policy Modules are automatically destructed when all the members of a process group or a shared group have died. However, users can explicitly ask the operating system to destroy Policy Modules that are not in use anymore, using the following call: int pm_destroy(pmo_handle_t pm_handle); The argument pm_handle is the handle returned by pm_create. Any association to this PM that already exists will remain effective, and the PM will only be destroyed when the section of the address space that is associated to this PM is also destroyed (unmapped), or when the association is overridden via a pm_attach call. PPPPoooolllliiiiccccyyyy SSSSttttaaaattttuuuussss ooooffff aaaannnn AAAAddddddddrrrreeeessssssss SSSSppppaaaacccceeee Users can obtain the list of policy modules currently associated to a section of a virtual address space using the following call: typedef struct pmo_handle_list { pmo_handle_t* handles; uint length; } pmo_handle_list_t; int pm_getall(void* base_addr, size_t length, pmo_handle_list_t* pmo_handle_list); The argument base_addr is the base address for the section the user is inquiring about, length is the length of the section, and pmo_handle_list is a pointer to a list of handles as defined by the structure pmo_handle_list_t. On success, this call returns the effective number of PMs that are being used by the specified virtual address space range. If this number is greater than the size of the list to be used as a container for the PM handles, the user can infer that the specified virtual address space range is using more PM's than we can fit in the list. On failure, this call returns a negative integer, and errno is set to the corresponding error code. PPPPaaaaggggeeee 11110000 mmmmmmmmcccciiii((((5555)))) mmmmmmmmcccciiii((((5555)))) Users also have read-only access to the internal details of a PM, using the following call: typedef struct pm_stat { char placement_policy_name[PM_NAME_SIZE + 1]; char fallback_policy_name[PM_NAME_SIZE + 1]; char replication_policy_name[PM_NAME_SIZE + 1]; char migration_policy_name[PM_NAME_SIZE + 1]; char paging_policy_name[PM_NAME_SIZE + 1]; size_t page_size; int policy_flags; pmo_handle_t pmo_handle; } pm_stat_t; int pm_getstat(pmo_handle_t pm_handle, pm_stat_t* pm_stat); The argument pm_handle identifies the PM the user needs information about, and pm_stat is an out parameter of the form defined by the structure pm_stat_t. On success this call returns a non-negative integer, and the PM internal data in pm_stat. On error, the call returns a negative integer, and errno is set to the corresponding error code. SSSSeeeettttttttiiiinnnngggg tttthhhheeee PPPPaaaaggggeeee SSSSiiiizzzzeeee Users can modify the page size of a PM using the following MMCI call: int pm_setpagesize(pmo_handle_t pm_handle, size_t page_size); The argument pm_handle identifies the PM the user is changing the page size for, and the argument page_size is the requested page size. This call changes the page size for the PM's associated with the specified section of virtual address space so that newly allocated memory will use the new page size. On success this call returns a non-negative integer, and on error, it returns a negative integer with errno set to the corresponding error code. LLLLooooccccaaaalllliiiittttyyyy MMMMaaaannnnaaaaggggeeeemmmmeeeennnntttt One of the most important goals of memory management in a CCNUMA system like the Origin 2000 is the maximization of locality. IRIX uses several mechanisms to manage locality: +o IRIX implements dynamic memory migration to automatically attract memory to those processes that are making the heaviest use of a page of memory. +o IRIX replicates read-only memory sections, such as application and library code, in order to maximize local memory accesses and avoid interconnect contention. PPPPaaaaggggeeee 11111111 mmmmmmmmcccciiii((((5555)))) mmmmmmmmcccciiii((((5555)))) +o IRIX schedules memory in such a way that applications can allocate large amounts of relatively close memory pages. +o IRIX does topology aware initial memory placement. +o IRIX provides a topology aware process scheduler that integrates cache and memory affinity into the scheduling algorithms. +o IRIX allows and encourages application writers to provide initial placement hints, using high level tools, compiler directives, or direct system calls. +o IRIX allows users to select different policies for the most important memory management operations. TTTThhhheeee PPPPllllaaaacccceeeemmmmeeeennnntttt PPPPoooolllliiiiccccyyyy The Placement Policy defines the algorithm used by the physical memory allocator to decide what memory source to use to allocate a page in a multi-node CCNUMA machine. The goal of this algorithm is to place memory in such a way that local accesses are maximized. The optimal placement algorithm would have knowledge of the exact number of cache misses that will be caused by each thread sharing the page to be placed. Using this knowledge, the algorithm would place the page on the node currently running the thread that will generate most cache misses, assuming that the thread always stays on the same node. Unfortunately, we don't have perfect knowledge of the future. The algorithm has to be based on heuristics that predict the memory access patterns and cache misses on a page, or on user provided hints. All placement policies are based on two abstractions of physical memory nodes: +o Memory Locality Domains (MLDs) +o Memory Locality Domain Sets (MLDsets) MMMMeeeemmmmoooorrrryyyy LLLLooooccccaaaalllliiiittttyyyy DDDDoooommmmaaaaiiiinnnnssss A Memory Locality Domain or MLD with center c and radius r is a source of physical memory composed of all memory nodes within a "hop distance" r of a center node c. Normally, MLDs have radius 0,representing one single node. MLDs may be interpreted as virtual memory nodes. Normally the application writer defining MLDs specifies the MLD radius, and lets the operating system decide where it will be centered. The operating system tries to choose a center according to current memory availability and other placement parameters that the user may have specified such as device affinity and topology. PPPPaaaaggggeeee 11112222 mmmmmmmmcccciiii((((5555)))) mmmmmmmmcccciiii((((5555)))) Users can create MLDs using the following MMCI call: pmo_handle_t mld_create(int radius, long size); The argument radius defines the MLD radius, and the argument size is a hint specifying approximately how much physical memory will be required for this MLD. On success this call returns a handle for the newly created MLD. On error, this call returns a negative long integer and errno is set to the corresponding error code. MLDs are not placed when they are created. The MLD handle returned by the constructor cannot be used until the MLD has been placed by making it part of an MLDset. Users can also destroy MLDs not in use anymore using the following call: int mld_destroy(pmo_handle_t mld_handle); The argument mld_handle is the handle returned by mld_create. On success, this call returns a non-negative integer. On error it returns a negative integer and errno is set to the corresponding error code. MMMMeeeemmmmoooorrrryyyy LLLLooooccccaaaalllliiiittttyyyy DDDDoooommmmaaaaiiiinnnn SSSSeeeettttssss Memory Locality Domain Sets or MLDsets address the issue of placement topology and device affinity. Users can create MLDsets using the following MMCI call: pmo_handle_t mldset_create(pmo_handle_t* mldlist, int mldlist_len); The argument mldlist is an array of MLD handles containing all the MLD's the user wants to make part of the new MLDset, and the argument mldlist_len is the number of MLD handles in the array. On success, this call returns an MLDset handle. On error, this call returns a negative long integer and errno is set to the corresponding error code. This call only creates a basic MLDset without any placement information. An MLDset in this state is useful just to specify groups of MLDs that have already been placed. In order to have the operating system place this MLDset, and therefore place all the MLDs that are now members of this MLDset, users have to specify the wanted MLDset topology and device affinity, using the following MMCI call: int mldset_place(pmo_handle_t mldset_handle, topology_type_t topology_type, raff_info_t* rafflist, int rafflist_len, rqmode_t rqmode); PPPPaaaaggggeeee 11113333 mmmmmmmmcccciiii((((5555)))) mmmmmmmmcccciiii((((5555)))) The argument mldset_handle is the MLDset handle returned by mldset_create, and identifies the MLDset the user is placing. The argument topology_type specifies the topology the operating system should consider in order to place this MLDset, which can be one of the following: TOPOLOGY_FREE This topology specification lets the Operating System decide what shape to use to allocate the set. The Operating System will try to place this MLDset on a cluster of physical nodes as compact as possible, depending on the current system load. TOPOLOGY_CUBE This topology specification is used to request a cube-like shape. TOPOLOGY_CUBE_FIXED This topology specification is used to request a physical cube. TOPOLOGY_PHYSNODES This topology specification is used to request that the MLDs in an MLDset be placed in the exact physical nodes enumerated in the device affinity list, described below. TOPOLOGY_CPUCLUSTER This topology specification is used to request the placement of one MLD per CPU instead of the default one MLD per node. In an Origin 3000, the number of cpus on a fully populated node is 4, hence each node can have up to 4 MLDs placed per node. For a node with less than the maximum number of cpus available the number of MLDs placed on that node will not exceed the actual number of CPUs. Also if cpusets are in use, the MLDs will be placed on nodes that are part of the defined cpuset. This topology is useful when the placement policy is managing cache coloring relative to MLDs instead of virtual memory regions. The topology_type_t type shown below is defined in <sys/pmo.h>. /* * Topology types for mldsets */ typedef enum { TOPOLOGY_FREE, TOPOLOGY_CUBE, TOPOLOGY_CUBE_FIXED, TOPOLOGY_PHYSNODES, TOPOLOGY_CPUCLUSTER, TOPOLOGY_LAST } topology_type_t; PPPPaaaaggggeeee 11114444 mmmmmmmmcccciiii((((5555)))) mmmmmmmmcccciiii((((5555)))) The argument rafflist is used to specify resource affinity. It is an array of resource specifications using the structure shown below: /* * Specification of resource affinity. * The resource is specified via a * file system name (dev, file, etc). */ typedef struct raff_info { void* resource; ushort reslen; ushort restype; ushort radius; ushort attr; } raff_info_t; The fields resource, reslen, and restype define the resource. The field resource is used to specify the name of the resource, the field reslen must always be set to the actual number of bytes the resource pointer points to, and the field restype specifies the kind of resource identification being used, which can be any of the following: RAFFIDT_NAME This resource identification type should be used for the cases where a hardware graph path name is used to identify the device. RAFFIDT_FD This resource identification type should be used for the cases where a file descriptor is being used to identify the device. The radius field defines the maximum distance from the actual resource the user would like the MLDset to be place at. The attr field specified whether the user wants the MLDset to be placed close or far from the resource: RAFFATTR_ATTRACTION The MLDset should be placed as close as possible to the specified device. RAFFATTR_REPULSION The MLDset should be placed as far as possible from the specified device. The argument rafflist_len in the mldset_place call specifies the number of raff structures the user is passing via rafflist. There must be at least as many raff structures passed as the size of the corresponding mldset or the operation will fail and EINVAL will be returned. Finally, the rqmode argument is used to specify whether the placement request is ADVISORY or MANDATORY: PPPPaaaaggggeeee 11115555 mmmmmmmmcccciiii((((5555)))) mmmmmmmmcccciiii((((5555)))) /* * Request types */ typedef enum { RQMODE_ADVISORY, RQMODE_MANDATORY } rqmode_t; The Operating System places the MLDset by finding a section of the machine that meets the requirements of topology, device affinity, and expected physical memory used. The mldset_place call returns a non-negative integer on success. On error, it returns a negative integer and errno is set to the corresponding error code. Users can destroy MLDsets using the following call: int mldset_destroy(pmo_handle_t mldset_handle); The argument mldset_handle identifies the MLDset to be destroyed. On success, this call returns a non-negative integer. On error it returns a negative integer and errno is set to the corresponding error code. LLLLiiiinnnnkkkkiiiinnnngggg EEEExxxxeeeeccccuuuuttttiiiioooonnnn TTTThhhhrrrreeeeaaaaddddssss ttttoooo MMMMLLLLDDDDssss After creating MLDs and placing them using an MLDset, a user can create a Policy Module that makes use of these memory sources, and attach sections of a virtual address space to this Policy Module. We still need to make sure that the application threads will be executed on the nodes where we are allocating memory. To ensure this, users need to link threads to MLDs using the following call: int process_mldlink(pid_t pid, pmo_handle_t mld_handle); The argument pid is the pid of the process to be linked to the MLD specified by the argument mld_handle. On success this call return a non- negative integer. On error it returns a negative integer and errno is set to the corresponding error code. This call sets up a hint for the process scheduler. However, the process scheduler is not required to always run the process on the node specified by the mld. The scheduler may decide to temporarily use different cpus in different nodes to execute threads to maximize resource utilization. NNNNaaaammmmeeee SSSSppppaaaacccceeeessss FFFFoooorrrr MMMMeeeemmmmoooorrrryyyy MMMMaaaannnnaaaaggggeeeemmmmeeeennnntttt CCCCoooonnnnttttrrrroooollll +o TTTThhhheeee PPPPoooolllliiiiccccyyyy NNNNaaaammmmeeee SSSSppppaaaacccceeee. This is a global system name space that contains all the policies that have been exported and therefore are available to users. The domain of this name space is the set of exported policy PPPPaaaaggggeeee 11116666 mmmmmmmmcccciiii((((5555)))) mmmmmmmmcccciiii((((5555)))) names, strings of characters such as "PlacementDefault", and its range is the corresponding set of policy constructors. When a user creates a policy module, he or she has to specify the policies for all selectable policies by name. Internally, the operating system searches for each name in the Policy Name Space, thereby getting hold of the constructors for each of the specified policies, which are used to initialize the actual internal policy modules. +o TTTThhhheeee PPPPoooolllliiiiccccyyyy MMMMaaaannnnaaaaggggeeeemmmmeeeennnntttt OOOObbbbjjjjeeeecccctttt NNNNaaaammmmeeee SSSSppppaaaacccceeee. This is a per-process group, either shared (sprocs) or not shared (forks), name space used to store handles for all the Policy Management Objects that have been created within the context of any of the members of the process group. The domain of this name space is the set of Policy Management Object (PMO) handles and its range is the set of references (internal kernel pointers) to the PMO's. PMO handles can refer to any of several kinds of Policy Management Objects: - Policy Modules - Memory Locality Domains (MLDs) - Memory Locality Domain Sets (MLDsets) The PMO Name Space is inherited at fork or sproc time, and created at exec time. SSSSEEEEEEEE AAAALLLLSSSSOOOO numa(5), migration(5), mtune(4), /var/sysgen/mtune/numa, refcnt(5), replication(5), nstats(1), sn(1), mld(3c), mldset(3c), pm(3c), migration(3c), pminfo(3c), numa_view(1), dplace(1), dprof(1). PPPPaaaaggggeeee 11117777